Network-aware optimization in distributed data stream management systems
نویسنده
چکیده
The management of streaming data in distributed environments is gaining importance in many application areas such as sensor networks and e-science. This is mainly due to both, the need for immediate reactions to important events in input streams as well as the requirement to efficiently handle enormous data volumes that are generated, for example, by modern scientific experiments and observations. At the same time, data needs to be accessible by various collaborative, often geographically distributed communities and sciences. In this thesis, we address the above issues by introducing a model and a prototype implementation of a distributed data stream management system (DSMS), and by devising network-aware optimization techniques for efficient resource usage in terms of computational load and network traffic in such a system. We use the term StreamGlobe to denote both, the theoretical model of our DSMS as well as its actual prototype implementation. The prototype serves as a research platform for evaluating our optimization approaches which are at the core of this thesis. We further use an applicationspecific astrophysical flavor of StreamGlobe called StarGlobe to demonstrate the applicability and effectiveness of distributed stream processing in an actual astrophysical e-science scenario. Scarce resources such as computational power and network bandwidth limit the number of continuous queries a DSMS can handle concurrently. Making intelligent and efficient use of these valuable resources is thus mandatory in order to offer the best service possible to users. We achieve this goal by introducing data stream sharing, an optimization technique based on in-network query processing and multi-subscription optimization. In-network query processing enables us to distribute continuous query processing in the network while multisubscription optimization allows us to share data streams for satisfying multiple similar queries. Thus, data stream sharing allows for efficient resource usage and provides a potential increase in the number of queries a distributed DSMS can process concurrently with the available resources. The effectiveness of data stream sharing depends on the existence of streams in the network that are suitable for sharing. If the available preprocessed result streams of previously registered queries do not contain all the necessary data required by a new query, sharing must resort to using the corresponding original streams to satisfy the new query. To alleviate this problem, we develop data stream widening, a technique that is able to alter existing streams to additionally contain all the necessary data for a new query. We introduce an abstract property tree (APT) and its extension, an abstract property forest (APF), for representing, matching, and merging queries and data in a distributed DSMS to enable the combination of data stream sharing and data stream widening. The improved representation of queries and streams allows for a more effective optimization and additionally supports a larger class of queries. Data stream widening requires the treatment of disjunctive predicates. However, traditional query optimization largely neglects the handling of such predicates. We therefore devise, compare, and discuss methods for matching and evaluating disjunctive predicates in the context of data stream sharing and data stream widening. The presented approaches are generic and thus applicable to other domains as well. Altogether, data stream sharing, data stream widening, and the methods for handling disjunctive predicates add up to a powerful optimization approach for continuous queries over data streams in a distributed DSMS such as StreamGlobe.
منابع مشابه
Communication-Aware Traffic Stream Optimization for Virtual Machine Placement in Cloud Datacenters with VL2 Topology
By pervasiveness of cloud computing, a colossal amount of applications from gigantic organizations increasingly tend to rely on cloud services. These demands caused a great number of applications in form of couple of virtual machines (VMs) requests to be executed on data centers’ servers. Some of applications are as big as not possible to be processed upon a single VM. Also, there exists severa...
متن کاملA Quality-Centric Data Model for Distributed Stream Management Systems
It is challenging for large-scale stream management systems to return always perfect results when processing data streams originating from distributed sources. Data sources and intermediate processing nodes may fail during the lifetime of a stream query. In addition, individual nodes may become overloaded due to processing demands. In practice, users have to accept incomplete or inaccurate quer...
متن کاملQuality-Aware Distributed Data Delivery for Continuous Query Services
We consider the problem of distributed continuous data delivery services in an overlay network of heterogeneous nodes. Each node in the system can be a source for any number of data streams and at the same time be a consumer node that is receiving streams sourced at other nodes. A consumer node may define a filter on a source stream such that only the desired portion of the stream is delivered,...
متن کاملOptimum energy management strategy in smart distribution networks considering the effect of distributed generators and energy storage units
The penetration of distributed generation sources and energy storage units in distribution networks is increasing. Therefore, their impact on the reliability of the network is very necessary. In this study, in order to provide an optimal energy management strategy for smart distribution network, the multi-objective optimization problem of dynamic distribution feeder reconfiguration in the pres...
متن کاملTuple Routing Strategies for Distributed Eddies
Many applications that consist of streams of data are inherently distributed. Since input stream rates and other system parameters such as the amount of available computing resources can fluctuate significantly, a stream query plan must be able to adapt to these changes. Routing tuples between operators of a distributed stream query plan is used in several data stream management systems as an a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008